Bootstrap Approximation of Model Selection Probabilities for Multimodel Inference Frameworks

Most statistical modeling applications involve the consideration of a candidate collection of models based on various sets of explanatory variables. The candidate models may also differ in terms of the structural formulations for the systematic component and the posited probability distributions for the random component. A common practice is to use an information criterion to select a model from the collection that provides an optimal balance between fidelity to the data and parsimony. The analyst then typically proceeds as if the chosen model was the only model ever considered. However, such a practice fails to account for the variability inherent in the model selection process, which can lead to inappropriate inferential results and conclusions. In recent years, inferential methods have been proposed for multimodel frameworks that attempt to provide an appropriate accounting of modeling uncertainty. In the frequentist paradigm, such methods should ideally involve model selection probabilities, i.e., the relative frequencies of selection for each candidate model based on repeated sampling. Model selection probabilities can be conveniently approximated through bootstrapping. When the Akaike information criterion is employed, Akaike weights are also commonly used as a surrogate for selection probabilities. In this work, we show that the conventional bootstrap approach for approximating model selection probabilities is impacted by bias. We propose a simple correction to adjust for this bias. We also argue that Akaike weights do not provide adequate approximations for selection probabilities, although they do provide a crude gauge of model plausibility.


Introduction
When presented with a set of variables posited as potential predictors of a targeted outcome, researchers harness an array of techniques to construct an appropriate descriptive or predictive model.Historically, the conventional approach entailed the formulation of a model under the guidance of an expert, who, grounded in a scientific understanding of the underlying mechanics of the observed outcome, would advocate for the structure of the proposed model.In contemporary practice, while domain expertise continues to inform the compilation of potential explanatory variables, the adoption of an all-encompassing model incorporating all of these variables is rare.Instead, scientists and statisticians have gravitated towards model selection algorithms (e.g., best subsets selection, forward selection, backward elimination, stepwise selection, and the LASSO), which utilize the sample at hand to yield models exhibiting more parsimonious structures than the comprehensive global model.
Once a model is selected, the common practice is to conduct inference on that model, proceeding as if this were the only model to ever be considered.However, different samples from the same population may lead to a selection of models with different structures.Therefore, the standard practice neglects the sampling variability inherent in the selection process.This pervasive issue in contemporary applications of statistics was described by Breiman as a "quiet scandal in the statistical community" [1].
A direct consequence of this oversight is the inherent difficulty in replicating the results of a modeling analysis using subsequent samples [2,3].Therefore, inferences contingent on the chosen model are profoundly impacted, raising concerns regarding the bias of regression effect estimators, the accuracy of their estimated standard errors, and the validity of associated p-values or confidence intervals [3].For instance, failing to account for model selection variability will typically result in smaller standard errors, which in turn results in erroneously smaller p-values and overly optimistic, narrower confidence intervals.
In addition, common model selection procedures are prone to including spurious effects in the final model [4,5].In the setting of regression models with correlated variables, the inclusion of spurious effects can heavily influence the estimates and the interpretation of important effects.For example, if two explanatory variables are correlated, their estimated effects can vastly vary for models that only include one of the variables as opposed to models that include both variables [3].
In recent years, various approaches have been proposed that are aimed at addressing the estimation of regression effects while accounting for model selection variability.These approaches can often be characterized as multimodel inference procedures, where regression effect estimates are not solely derived from a single model but rather from the entire collection of models.Conceptually, such procedures are aligned with ensemble methods, where the final prediction or classification is derived through the combination of multiple submodels.
The foundational principles of multimodel inference are influenced by Bayesian principles, particularly the concept of Bayesian model averaging (BMA).The BMA framework is founded on the recognition that different estimates are contingent upon specific models, and that these models possess varying probabilities of arising, given the observed data.Consequently, the posterior mean of any quantity of interest should encompass contributions from all the models based on their posterior probabilities.In essence, the posterior mean of the target quantity should be an average of the estimates conditional on each model, weighted by the posterior probability of that particular model.
In the frequentist domain, several multimodel inference methodologies mirror the BMA framework.A prominent approach involves the utilization of Akaike weights to play the role of model probabilities [6].These Akaike weights offer a measure of evidence for each model in a candidate set, allowing us to compute the estimate of each regression coefficient by aggregating their Akaike-weighted model-specific estimates.In this manner, a model averaging calculation akin to BMA is performed but with the incorporation of each model's Akaike weight as opposed to its posterior probability.
The basis of incorporating Akaike weights into the multimodel inference framework presupposes the validity of Akaike weights as approximations to model selection probabilities.In fact, Burnham and Anderson claim that Akaike weights "may be interpreted as the probability that model i is the actual expected K-L [Kullback-Leibler] best model for the sampling situation considered" [6].However, in this paper, we argue that Akaike weights do not provide adequate approximations to model probabilities.Instead, we demonstrate that repeating the model selection process via the bootstrap yields better approximations to model selection probabilities, which can then be incorporated for the purpose of conducting multimodel inference.Yet this bootstrap procedure must be implemented with caution because of the bias generated from bootstrapping the likelihood-based information criteria.Under appropriate conditions, we derive the form of this bias and propose a simple correction.

Background
The Akaike information criterion (AIC), introduced by Hirotugu Akaike in his seminal 1973 publication "Information Theory and an Extension of the Maximum Likelihood Principle", has emerged as a pivotal tool in statistical model selection.To appreciate the evolution of the AIC framework, it is important to recall that the likelihood function encapsulates the fidelity of a model to the observed data.Consequently, it serves as a potent instrument for discerning the propriety of different candidate models.However, the adherence of the likelihood function to the data increases in tandem with the complexity of the model.If models are assessed based on the magnitude of the empirical likelihood, higher-dimensional models will invariably be favored, thus potentially culminating in the selection of models with excessively complex structures.Therefore, to employ the maximum likelihood framework for model selection, a recalibration of the likelihood function is necessary so that the favorability of complexity is conferred solely when warranted.This essential adaptation is achieved by AIC.
For any prospective model parameterized by θ, the AIC is given by where k denotes the dimension of θ, and θ denotes the maximum likelihood estimate of the parameter vector θ.The statistic −2ℓ( θ|y) is known as the "goodness-of-fit" term, and represents the log of the empirical likelihood for the model parameterized by θ based on the data y .Ergo, this term inherently gravitates towards models exhibiting intricate configurations, thus conceivably engendering a predisposition towards complexity.Counterbalancing this predilection, the term 2k is known as the "penalty", which increases as the model complexity escalates.In the pursuit of selecting an optimal model from a candidate collection, the objective is to choose the model exhibiting the minimal AIC value.This AIC value represents an equilibrium between the "goodness-of-fit" term and the "penalty" term.To elucidate this point, a model of pronounced intricacy is able to assimilate the subtleties inherent in the data, consequently resulting in a diminished "goodness-of-fit" term.However, the associated complexity results in a high "penalty" term, which means the more complex models may not necessarily produce the smallest AIC value.
In the traditional linear regression setting, consider a vector of outcomes y that we aim to model using a set of (p − 1) explanatory variables plus an intercept term, represented by the rank-p design matrix X.Let θ = β T σ 2 T , where β T = β 1 • • • β p denotes the corresponding regression coefficients, and let M = {M 1 , M 2 , . . ., M i , . . ., M 2 (p−1) } represent the set of all conceivable candidate models that can be constructed from the (p − 1) variables.Additionally, let AIC i denote the value of AIC associated with the i th model M i ∈ M, and let AIC min denote the minimum AIC value across all the models in the candidate set M. In other words, we have that If we then let ∆ i = AIC i − AIC min , the Akaike weight for model M i is defined as

Akaike Weights vs. Model Probabilities
Akaike weights have been frequently utilized as approximations to model selection probabilities.At first glance, this seems like a reasonable practice.For instance, Equation (2) has a similar structure to the Bayesian posterior model probability approximated via the Bayesian information criterion (BIC); see, for instance, the work of Neath and Cavanaugh [7].However, although Burnham and Anderson advocate for the use of Akaike weights as approximations for model selection probabilities, they also recognize that such an approximation may be crude [6].Through the following example, we argue that Akaike weights do not serve as adequate approximations to selection probabilities.Consider a simple case where we have a null and an alternative model represented by M 0 and M 1 , respectively.Then, the corresponding Akaike weights are given by w 0 = e −0.5∆ 0 e −0.5∆ 0 + e −0.5∆ 1 and w 1 = e −0.5∆ 1 e −0.5∆ 0 + e −0.5∆ 1 .
However, the Akaike weights can be rewritten as follows: Similarly, . Now, note the following: This derivation implies that if the Akaike weights are treated as actual model choice probabilities, then for models M i and M j such that i ̸ = j, the log of the corresponding relative risk is proportional to AIC i − AIC j .However, this property is erroneous, in that the relative risk based on the actual model selection probabilities will not satisfy this proportionality.
To illustrate this incongruity, consider the case where M 0 and M 1 are parameterized by θ 0 and θ 1 , respectively, but |θ 1 | − |θ 0 | = 1; in other words, the dimension of the parameter vectors only differs by one.In addition, assume that the null is the true data-generating model.Note that in this case, ∆ 0 = 0 =⇒ AIC 0 − AIC 1 ≤ 0; therefore, if we let P(M 0 ) be the selection probability for model M 0 , then P(M 0 ) is equivalent to P(AIC 0 − AIC 1 ≤ 0).Under a true null model, we can write This means that asymptotically, the relative risk of M 0 over M 1 is equal to For the sake of argument, suppose that w 0 and w 1 can serve as approximations to P(M 0 ) and P(M 1 ).However, note that Since the ratio (w 0 /w 1 ) can never attain the actual value of the relative risk, the Akaike weights should not be treated as approximations to the probability of choosing a particular model.

Model Frequencies as Multinomial Probability Vector Approximations
Consider a scenario in which we can draw a set of S samples from the true datagenerating model, denoted by g.For each sample s ∈ S, we evaluate the quantity Now, it is worth noting that if M i emerges as the optimal model for sample s, then AIC i = AIC min , which implies that ∆ i = 0 for such a sample.Subsequently, we define a random vector τ T = (τ 1 , τ 2 , . . ., τ 2 (p−1) ) to denote the number of selections for each model across all S draws.Therefore, where ∆ i (s) denotes the AIC difference ∆(i) for sample s.
Given this configuration, we can assert that τ ∼ multinomial(S, π), where S represents the number of trials, and π T = (π 1 , π 2 , . . ., π 2 (p−1) ) constitutes a vector of probabilities.Specifically, π i corresponds to P(M i ), the probability of selecting model M i .Furthermore, since model M i is exclusively chosen when ∆ i = 0, we deduce that π i = P(∆ i = 0).Finally, we obtain: Thus, we can define model selection probabilities as the probability vector π from a multinomial distribution with S number of trials, where each trial s ∈ S is a sample from the true data generating model.Therefore, a model selection probability approximation should be an attempt to estimate π.For this task, we will employ the non-parametric bootstrap, though we need to consider a set of important caveats delineated in Section 3.2.

Bias Induced from Bootstrapping Fitted Likelihoods
Let y T = (y 1 , . . ., y n ) denote the observation vector, which we assume is comprised of n independent variates y i .To assess the disparity between a data generating model g(y) and a parametric approximating model f (y|θ), we can employ the Kullback-Leibler information between g(y) and f (y|θ) with respect to g(y), given by known as the Kullback-Leibler discrepancy (KLD), is often used as a substitute for I KL since E g [log g(y)] does not depend on the structure of the approximating model f (y|θ).
In practice, the goal is to determine the propriety of fitted models of the form f (y| θ), where θ = argmax θ∈Θ ℓ(θ|y).The KLD for the fitted model is given by To develop bootstrap approximations to model selection probabilities, we will use the "plug-in" principle as described by Efron and Tibshirani [8].In our current application, we replace the true data generating distribution g by the empirical distribution ĝ; the original sample y by a bootstrap sample y * drawn from ĝ; and finally, θ by the maximum likelihood estimate (MLE) θ * derived under the bootstrap sample y * .For clarification purposes, it is worth emphasizing that E g (•) is the expected value with respect to the data generating distribution, while E * (•) is the expected value with respect to the bootstrap distribution.In fact, any notation presented with the symbol * will correspond to a bootstrap-based construct.For example, the bootstrap version of AIC is defined as With the preceding replacements, the bootstrap version of the KLD is given by where ℓ i represents the contribution to the likelihood based on the ith response y i .The work by Dajles and Cavanaugh [9] shows that where k is the dimension of the model.With these results, we can provide the following lemma.
Lemma 1.Consider a large-sample setting and assume that the candidate model parameterized by θ subsumes the true model.Then, for the corresponding empirical log-likelihood given by ℓ( θ|y), we have that where k = |θ| and E * is the expectation with respect to the bootstrap distribution.
Proof.Let d(g, θ) as defined in (3) denote the KLD for the fitted candidate model.Now, consider the extended information criterion (EIC) [10] for this fitted model, which is given by Under the assumption that the candidate model subsumes the true model, it has been shown that EIC is an asymptotically unbiased estimator of the expected KLD, which means that E g (d(g, θ)) ≈ E g (EIC) [10].Therefore, by applying the expected value under the generating model g to the expression for EIC, we obtain the following: 4).Now, under the same modeling specification assumption, we know that AIC is also an asymptotically unbiased estimator of the expected KLD.Therefore, Recall the approximation (5), which states that Hence, we have that The preceding establishes the lemma.
The results from Lemma 1 indicate that if we want to build a bootstrap distribution of AIC variates for which the expected value of each variate is approximately unbiased for the expected value of the fitted KLD, then we must adjust the variates by k.Moreover, the results from Lemma 1 in combination with the expression (5) reveal a deeper understanding of the behavior of fitted log-likelihoods when subjected to a bootstrap procedure.
For the purpose of presenting parallel notation, employing the trivial result that −2ℓ( θ|y) = E * −2ℓ( θ|y) , we have the following asymptotic results: Equations ( 6)-( 8) exhibit the intricate connection between the penalty term and the level of optimism conveyed by fitted log-likelihoods in the context of the bootstrap.In Equation ( 6), the estimate of the parameter θ is based on the bootstrap data, and the likelihood evaluates the efficacy of the fitted model parameterized by θ * in predicting the original sample y.This corresponds to the classical setting in which the bootstrap is used to assess the predictive accuracy of a fitted model.Since some of the elements of y will not be represented in y * , one can think of this setting as representing pseudo out-of-sample prediction.Here, the penalty term should be k, the number of parameters in the model.Now, when we look at Equation ( 7), the likelihood evaluates the efficacy of the fitted model parameterized by θ in predicting the sample y that was used to obtain θ * .The strong dependence of θ and y results in overly optimistic prediction, thus justifying the need to increase the penalty to 2k.Analogously, when we look at Equation ( 8), the likelihood evaluates the efficacy of the fitted model parameterized by the bootstrap replicate θ * in predicting the bootstrap sample y * used to obtain θ * .The strong dependence of θ * and y * again results in overly optimistic prediction.Moreover, the optimism is increased by the duplicated elements in the fitting sample y * , i.e., there are more independent pieces of information in y than in y * .Here, the penalization needs to be increased to 3k.

Bootstrap Approximation of Model Selection Probabilities
In Section 3.1, we introduced the idea of the model selection probability as an entry of a probability vector in a multinomial distribution.With this framework, we have that where E * is the expected value with respect to the bootstrap distribution.Here, which serves as a bootstrap approximation of P(M i ).

Akaike Weights vs. Bootstrap Model Frequencies: Simulation
To better understand the behavior of Akaike weights vs. BMFs, consider the true data-generating model given by y i = β 0,1 + ϵ i , with ϵ i ∼ N(0, 1) and β 0,1 = 1.Now, suppose the null and alternative candidate models under consideration have the following form: where x i2 ∼ N(0, 1) and η i ∼ N(0, σ 2 ).Thus, the null model is correctly specified and the alternative is overspecified.For 1000 bootstrap iterations and 500 samples from the data-generating model, each of size N = 20, we can compute the Akaike weights and the AIC-based bootstrap model frequencies for M 0 and M 1 .The results are displayed in Figure 1.Here, the discrepancy between Akaike weights and BMFs is noticeable.The red, vertical dotted lines in Figure 1 represent the boundaries associated with the Akaike weights.More precisely, similarly to the log of the relative risk, the Akaike weights are bounded by a function of the minimum difference between the empirical log-likelihoods of the null and the alternative models.In other words, since −2(ℓ( θ0 |y) − ℓ( θ1 |y)) > 0, then w 0 = 1 1 + e 0.5(AIC 0 −AIC 1 ) < 1 1 + e 0.5(−2) = 0.7311, and similarly,  Table 1 shows the average of the Akaike weights and the BMFs over all 500 simulated data sets.Herein, we see that the average Akaike weights and the average BMFs are not close to each other.However, although not identical in value, the average BMF is close to the empirical estimate of the true model probability.For these 500 data sets, we have that P(M 0 = 0) ≈ 0.8440 and 500 ∑ s=1 P * (∆ * 0 (s) = 0) 500 ≈ 0.7566.
Conversely, the average Akaike weight for M 0 is 0.6166, which is far from the empirical estimate of P(M 0 ).

More Complex Simulation Scenarios
The setting explored in Section 3.4 provides an intuitive and illustrative example of the behavior of Akaike weights and BMFs and the differences between them.However, one can consider other more complex simulation scenarios, though perhaps at the expense of mathematical clarity.
For instance, consider the data generating model given by with ϵ i ∼ N(0, 1), and where µ = {1, 1, 1} and Σ =   1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 In this setting, of the eight possible candidate models, only two models subsume the correct mean structure, while the remaining models are underspecified.Moreover, to test the versatility of BMFs, we also introduce some correlation between the explanatory variables.
Table 2 shows the performance one would expect given the theoretical underpinnings that we have developed and presented.In approximating the model probabilities for the small sample size, the performance of the average BMF is not ideal but is still superior to that of the Akaike weights.However, as the sample size increases, we observe a significant improvement in BMF performance.For the sample size of N = 1000, we note that the average BMFs are sufficiently close to their respective model selection probabilities.Now, consider a more complex case with the data generating model given by • N(0, 1), Z ∼ Bernoulli(π), π = 0.8, and In this setting, assuming that the models are fit using a Gaussian log-likelihood, each candidate model is misspecified with respect to the error distribution.With regard to the mean structure, two models subsume the correct structure, while the remaining models are underspecified.Table 3 illustrates that for a sufficiently large sample size (N = 1000), the adjusted bootstrap model frequencies (BMFs) outperform the Akaike weights.As the sample size decreases, we observe that the BMFs diverge from the simulated model selection probabilities.
The performance observed in the large sample size scenario aligns with theoretical expectations.For any information criterion based on empirical log-likelihoods, if the sample size is sufficiently large relative to the number of variables in the candidate models, the goodness-of-fit term will predominate and favor the selection of larger models that subsume the appropriate mean structure.Since the true model selection probabilities are calculated via AIC, model misspecification is likely to similarly affect both the BMFs and the simulated true model probabilities.However, keep in mind that for this setting, AIC is not guaranteed to provide an asymptotically unbiased estimator of the KLD, which means that the simulated true model selection probabilities will not be correctly calibrated.
As the sample size decreases, the penalty term becomes more influential, and the effects of violations to the regularity conditions under which AIC is asymptotically justified become more pronounced.Therefore, in small sample size settings, we expect to see greater discrepancies in model selection between the simulated samples and the bootstrap samples.Precisely elucidating these discrepancies is challenging and would require a rigorous mathematical exploration of the effect of misspecification on the performance of AIC as an asymptotically unbiased estimator of the expected KLD.To avoid mathematical ambiguities, it is best to consider a simple case as demonstrated in Section 3.4.(3) Identify the model M i that corresponds to the minimum information criterion value.
In other words, choose model M i such that where IC * (M i ) is the bootstrap version of IC(M i ).(4) Record βj * , the maximum likelihood estimate (MLE) of each β j corresponding to the covariates represented in the model selected from step 3.The MLEs for unselected variables are set to zero.(5) Repeat the aforementioned steps B times.
If we let y * (b) be a bootstrap sample from y, with b ∈ {1, 2, . . ., B}, and let β * j (b) be the estimate of the regression coefficient β j for the b th bootstrap sample under the model selected for that specific sample, the procedure described above results in a set of estimates given by { β * j (1), β * j (2), . . ., β * j (B)}.The smoothed estimator of βj is defined as For the confidence interval based on the smoothed estimator, Efron [11] proposes the following approach.Suppose that we have a sample y = y 1 y 2 . . .
to be the number of elements of y * (b) equaling the original data point y i , then the smoothed standard error of β * j can be estimated by However, bootstrap applications do not typically consider the "ideal" case with all possible N N bootstrap samples.For the "non-ideal" setting with B bootstrap samples and B not necessarily equal to N N , we can obtain an estimate of the smoothed standard error via , where Note that this estimator solely relies on {y * (1), y * (2), . . ., y * (B)}, which corresponds to the set of bootstrap samples used to estimate β * j .A 95% confidence interval (CI) for the smoothed estimator is called the smoothed 95% CI and it is given by β * j ± 1.96 sd jB .

Application in Biomedicine: Sulindac for the Treatment of Colonic and Rectal Ademonams in Patients with Familial Adenomatous Polyposis
Familial adenomatous polyposis (FAP) is an autosomal dominant genetic disease characterized by the development of thousands of polyps throughout the colon and rectum.This condition is rare; it occurs in 1 in 1000 people, and although it accounts for only 1% of all diagnosed colorectal cancers, it is the second most common inherited colorectal cancer syndrome.
FAP is the result of a mutation of a tumor suppressor gene on chromosome 5 known as the ademomatous polyposis coli (APC) gene.Polyps begin to arise in the early teens and if untreated, patients have a 100% lifetime risk of developing colorectal cancer.In addition, patients with FAP are at risk of developing extracolonic pathologies such as desmoid tumors (a solid connective tissue tumor), hepatoblastomas (liver tumors), and thyroid cancer [12].
In patients with FAP, early detection and treatment are essential for preventing the development of colorectal carcinoma, thus improving the prognosis.The standard treatment for FAP is colectomy with or without protectomy.Subtotal colectomy is desirable for many patients, but it requires continued surveillance, while total proctocolectomy does not require surveillance, but patients experience increased stool urgency and higher rates of urinary dysfunction.Therefore, there is a need for the development of non-surgical treatments for patients with FAP [13,14].
In 1993, the effect of Sulindac, a non-steroidal anti-inflammatory drug (NSAID), was investigated for the treatment of FAP in a randomized clinical trial [15].The study recruited 22 patients with FAP, 11 of whom were assigned to the treatment group that received 150 mg dosages of Sulindac, and 11 of whom were assigned to control group that received an identical-appearing placebo tablet.The study also considered the sex and age of the patients.
The results of the study were published in the paper titled "Treatment of Colonic and Rectal Adenomas with Sulindac in Familial Adenomatous Polyposis" in The New England Journal of Medicine [15].The data set is available in the R package "medicaldata" [16].
The main outcome of interest in the study is the proportionate difference in the number of polyps at 3 months compared to baseline.In other words, if we define D as the proportionate change in the number of polyps, then we have that In addition to this outcome, the explanatory variables in the data set include treatment (a value of 1 for Sulindac and 0 for the placebo), sex (a value 1 for male and 0 for female), and age.Moreover, in our modeling analysis, we consider the interaction between treatment and sex, and we define this variable as interaction = treatment • sex.
The primary objective of this application is to assess the selection probability of different linear models that can be constructed with the variables collected from the study.To maintain consistency with the original publication [15], all the candidate models are fitted using least squares regression; an inspection of the distribution of the residuals for the full model confirms that this is a reasonable approach.Ultimately, we will compare the model selection probability estimates that are obtained by utilizing the Akaike weights, the unadjusted bootstrap model frequencies and the bias-adjusted bootstrap model frequencies.The results are displayed in Table 4.In analyzing the results of this example, several important points emerge.First, it is worth emphasizing that all estimating approaches consistently point towards the model incorporating both the treatment and sex variables as the most likely.This finding is particularly reassuring, as each approach hinges on model selections through AIC.
However, a notable discrepancy arises in the degree to which this favored model is endorsed as reflected by the disparities between the Akaike weights and the BMFs.The Akaike weights suggest a model selection probability of 46.4%, whereas both the adjusted and unadjusted BMFs indicate a higher probability, hovering around 66% each.This distinction carries practical implications, especially in the context of estimating treatment effects and standard errors within the model averaging framework.A higher probability assigned to a single model implies a concentration of estimates from that model, leading to reduced variability due to model selection.
Further scrutiny reveals distinctions between the unadjusted and bias-adjusted BMFs in their estimation of probabilities for the second most popular model.Specifically, for the model containing treatment, sex, and the interaction, the unadjusted BMF yields an estimated probability of 12.3% (in proximity to the Akaike weight for the same model).In contrast, the adjusted version provides a markedly lower estimate of 0.075%.However, for the model involving only treatment, the adjusted approach yields a probability estimate of 16.2%.
To better appreciate the practical effects of these differences in model probability estimation, consider the results on Tables 5 and 6, which show that the smoothed CIs for the adjusted BMFs are narrower than for the unadjusted case.More importantly, the length of the smoothed CI for treatment is noticeably narrower for the adjusted setting.This occurs because, as noted previously, the second most popular model for the adjusted case is the model that only contains the treatment variable.In other words, 16.2% of the contributions to the smoothed estimate of treatment come from a model that allocates all the data to estimating the effect of treatment.On the other hand, for the unadjusted case, 12.3% of the contributions to the smoothed estimate of treatment come from a model that allocates the data to estimating treatment, sex, and the interaction term.
From an information theoretic point of view, this dissimilarity can be explained by considering the optimism inherent in fitted log-likelihoods.As established in Section 3.2, the fitted log-likelihood given by −2ℓ( θ * |y * ) serves as an overly optimistic measure of goodness-of-fit.Thus, to circumvent the unrealistic selection of complex models, the addition of a penalty term of size 2k (as employed in AIC) proves insufficient.Instead, an extra k must be incorporated to strike a proper balance between the goodness-of-fit term and the penalty term.Since the unadjusted BMF lacks this extra penalty, it is expected to tend to favor larger models than the adjusted BMF.

Conclusions
The current practice of statistical modeling recognizes that no single model can capture all of the features of the true data generating model, and that the uncertainty inherent in selecting a model from a candidate collection should ideally be accommodated in the development of inferential procedures.Statisticians have therefore proposed estimating the model parameters of interest by incorporating the contributions of multiple possible candidate models.However, each model's contribution must be proportional to its probability of being selected.
In this paper, we showed that the AIC-based bootstrap model frequency (BMF) serves as a reasonable approximation to the corresponding AIC-based model selection probability.However, we established that in the process of computing these BMFs, we must add an extra penalty to the bootstrap AIC values of size equal to the number of parameters in the candidate model.This penalty ensures that the bootstrap AIC will remain an asymptotically unbiased estimator of the expected value of the Kullback-Leibler discrepancy.Finally, we showed that the BMFs serve as better approximations to model selection probabilities than the commonly used Akaike weights.We exhibited the superior performance of BMFs through a simulation study, and illustrated their use relative to Akaike weights in a real data application.

Figure 1 .
Figure 1.AIC-based bootstrap model frequencies vs. Akaike weights for a null and an alternative model denoted by M 0 and M 1 , respectively.The simulation generates 500 data sets from the null model, which is assumed to be true, and each data set is of size N = 20.The BMFs are computed with 1000 bootstrap samples.

Table 1 .
Average of the bootstrap model frequencies and the Akaike weights for M 0 and M 1 over all 500 simulated data sets.

Table 2 .
Results of a simulation with properly specified errors.The simulation consists of 100 data sets from the data generating model, with varying sample sizes.The true model selection probabilities are estimated with the 100 simulated data sets.The BMFs are calculated with 100 bootstraps.The BMFs and Akaike weights are averaged over all 100 data sets.

Table 4 .
Akaike weights, unadjusted, and adjusted bootstrap model frequencies (BMFs) for models fit to the FAP data set.The BMFs are calculated using 1000 bootstrap samples.Zeros and ones are used to denote the absence or presence of the variable in a selected model.The intercept is included in each model.

Table 5 .
Smoothed estimates and smoothed 95% confidence intervals for regression coefficients from analysis of FAP data set.The results include the bias-adjusted and unadjusted values.

Table 6 .
Lengths of smoothed 95% confidence intervals from analysis of FAP data set.